MUSA 5000 Homework 2
Using Spatial Lag, Spatial Error and Geographically Weighted Regression to Predict Level of Investment in Public Housing RAD Conversions
Introduction
The Rental Assistance Demonstration was enacted in 2012 after the U.S. Congress directed the Department of Housing & Urban Development to demonstrate and then subsequently expand the process of restructuring the legal agreements associated with the nation’s public housing portfolio (Department of Housing & Urban Development, n.d.) (Department of Housing & Urban Development, n.d.). The process of “conversion” removes public housing properties from the Section 9 Low Rent Housing Program (otherwise known as U.S. public housing) created under the Housing Act of 1937, applies a new (less restrictive) use agreement and facilitates the transfer of the properties and/or new Section 8 project-based funding stream to a range of different public and private partnerships. These entities would then be able to solicit investment and mortgages on the former public housing properties, while taking steps to ensure the property’s immediate physical needs are met and there is sufficient reserves and cash flow for long-term needs.
Like Report #1, this paper utilizes over 10 years of administrative data from the RAD program and the public housing program to gain insight into whether the measure of physical deficiency of public housing sites is a significant driver of the level of immediate construction investment received following the removal of the property from the public housing inventory. Alternatively, program outcomes may be explained by financial marketability of the property and neighborhood. The other variables considered include the region’s average market rent, the number of smaller bedroom units, the size of the public housing site, the percentage of units converted under the program and the initial Section 8 rental assistance subsidy offered, and a series of demographic and economic characteristics of the surrounding neighborhood.
This paper departs from Report #1 in a few important ways. Whereas the previous report focused on the 48 contiguous states and the District of Columbia, this paper narrows the geographic coverage to 13 states in the South/Southeastern part of the United States. In reviewing previous program evaluation, I learned that southern states have participated much more frequently in the program (52% of all units converted through December 2023 when the included data was retrieved) (Stout et al., 2019). They are preliminary indications that the nature (in addition to extent) of their utilization of the program are different from non-Southern states, like in the conversion of public housing sites with no physical renovation (“paper conversions”).
The study area and enclosed public housing properties that have converted under RAD is illustrated below. There are 544 properties that will be used in the pursuant analysis, with the exception of the Local Indices of Spatial Autocorrelation in which 58 sites (outlined in red) are dropped given the lack of neighbors.
In Report #1, I carried out Ordinary Lease Squares (OLS) regression to examine the relationship between the dependent variable of hard cost spending per unit and 8 independent variables whereas this report utilizes 10 independent variables.
Dependent Variable: \(hardcost\_per\_unit\) = The total hard construction costs for RAD projects matched to the public housing site divided by the total number of units (RAD or not). This is the dependent variable in the model specification, meaning that it’s variation will be explained by the variation of all other variables
| Report #1 Predictor | Report #2 Predictor | Description |
|---|---|---|
| inspection_score | latest_score | Recalculated as most recent score instead of average of multiple scores |
| Total.PIC.Units | Total.PIC.Units | No change (number of public housing units on-site in 2012) |
| n/a | PUM.Contract.Rent | Added a measure of the initial subsidy guaranteed for each converted public housing site at the outset of the RAD program. |
| FMR.2.BR | FMR.2.BR | No Change (HUD-determined Fair Market Rent for 2BR units) |
| pct_01br | pct_01br | No Change (Percentage of 0 and 1BR units at public housing site) |
| n/a | pct_converted | Added the percentage of pre-RAD units converted |
| pct_black | pct_black | No Change (Tract black population percentage) |
| pct_owners | pct_owners | No Change (Tract ownership rate) |
| pct_vacant | pct_vacant | No Change (Tract housing vacancy rate) |
| med_hvalue | n/a | Removed census tract median home value in order to include all public housing sites |
| n/a | urbanicity | 2010 urbanicity measure from Department of Agriculture with 1-3 as urban and 4-10 as rural |
OLS procedures are often not appropriate when dealing with spatial data sets. Real estate is notoriously local, so nearby market comps for this novel transaction are theoretically plausible to have an impact holding all else constant. In addition, the participating public housing authorities (PHAs) cover distinct geographic territories.
This paper improves on earlier statistical procedures by accounting for the spatial nature of public housing properties in the 13 states in the study area. More specifically, it will use spatial log, spatial error and geographically weighted regression to determine if these spatially-considerate methods perform better than OLS.
Methods
Data Cleaning and Preparation
Report #1 identified a practical limitation in the administrative data being analyzed which undercut the original analysis. While the study sought to link “destination” RAD transactions to “sending” public housing sites using the identifying PIC # which is used within all data sets, it was discovered that in a non-insignificant number of conversions, a destination transaction can have multiple sending sites and sending site can have multiple destinations. When the former happens, all but one of the identifying PIC # for the sending sites is deprecated and it is not possible to fully know which sites have converted units. This issue was rectified by the provision of a “crosswalk” from the Department of Housing & Urban Development which trace each conversion “dyad” with each instance of a sending site and a destination transaction. This led to the identification of approximately 130 more sites where conversions had taken placed, and the ability to calculate a weighted average for sending sites that were linked to multiple RAD transactions with different hard cost budgets. There were also approximately 174 sites in Report #1 that did not have a census tract median housing value and were dropped from the analysis. Instead, this variable was dropped from the analysis to maximize the national public housing sites to 1332 total sites and 1246 sites with location information that could be drawn from in defining the study area for this study.
A Description of the Concept of Spatial Autocorrelation
In 1970, Waldo Tobler invoved the first law of geography that, “Everything is related to everything else, but near things are more related than distant things” (Tobler, 1970, pp. 236–237). Tobler articulates the basic premise behind all spatial statistics, that the relationship between objects of interest strengthens with their proximity. In the context of public housing, practically-speaking, it would be extremely unlikely to observe two adjacent properties with radically different housing conditions as a result of general maintenance and upkeep. One thing done to one property would likely be done to the other, and the set of external social conditions and social capital that impact one property would be identical to what impacts the other. The existence of spatial dependencies and the desire to understand the spatial form of autocorrelation (in which statistical variables are related with themselves in some form) motivates the study of spatial relationships and the definition of various “neighbor” relationships like queens neighbors (which can move in any direction on a chess board, including to spaces that are connected by a single point), rook neighbors (which can only move to spaces which share a segment) and distance neighbors which are simply within 500 feet or 25 miles of each other. There are three main methods and procedures used within this paper to assess spatial autocorrelation.
Moran’s I
Moran’s I has been a form of measuring spatial correlation since the 1950, and it is likely the most widely adopted measure. It measures whether similar values cluster together (positive autocorrelation) or are more dispersed (negative autocorrelation). It is largely synonymous to the calculation of the correlation coefficient which measures the direction and strength between two variables where there their respective differences from the mean are multiplied and divided by the sum of squared differences from the mean. With Moran’s I, the product of the difference of means in the numerator is modified by a weight matrix that reflects the relevant observation to observation neighbor relationship. If a basic queen or rook neighbor relationship is used or a distance threshold smaller than the geographic extent of the data, the weight matrix is likely 0 for several observations not adjacent to each other. The specific equation for Moran’s I ($I$) is provided in Equation 1.
\[ \begin{aligned} (Eq. 1): I= &\frac{n}{\sum_{i=1}^n \sum_{j=1}^n w_{ij}}\frac{\sum_{i=1}^n \sum_{j=1}^n w_{ij}(X_i - \bar{X}) (X_j - \bar{X})}{\sum_{i=1}^n (X_i - \bar{X})^2} \end{aligned} \]
\[ \begin{aligned} I &= \text{Moran's I} \\ n &= \text{Number of observations, whether points or shapes} \\ w_{ij} &= \text{weight indexing location of i relative to j} \\ \bar{X} &= \text{Number of observations in the samplel} \\ X_i &= \text{Variable value at location i} \\ X_j &= \text{Variable value at location j} \\ \end{aligned} \]
For the purpose of this analysis, I am using a weight matrix associated with the neighbor definition of 25 miles, or 40,234 meters in the map projection of meters for WGS 84 EPSG:3857. In preliminary analysis, weight matrices were considered for 1 mile to 125 miles and 25 miles was selected as a result of considering spatial relationships that were coherent with the data. 25 miles was likely to encompass a city, whereas 125 miles didn’t have an apparent theoretical justification for two properties having a spatial relationship. The following map illustrates the neighbor relationships in they study area associated with 25 miles.
It is noted that 125 miles did show evidence of continuing spatial autocorrelation. Subsequent insight after completion of this paper’s analysis did indicate that (a) the R script was incorrectly specified during the initial examination of multiple weight matrices which created the illusion that 25 miles was significant and (b) it is generally a better practice to select the highest distance threshold where spatial autocorrelation is still present. Statisticians may also conduct the analysis with multiple weight matrices to avoid having current theoretical justifications determine the scope of the analysis and the results, avoiding confirmation bias. As an example, HUD provides direct oversight of public housing and conducts the REAC inspections in the data set, and has regional offices which may provide a theoretical motivation for significant neighbor relationships over 125 miles.
Hypothesis Testing
Moran’s I is an individual calculation, but is used in significant testing with the deployment of a random permutations of a Monte Carlo simulation. The observed variable is randomly assigned across the differently spaced spatial points, with repeated calculations of the Moran’s I (999 times in the case of this paper). This generates a particular random distribution that the observed Moran’s I can be measured against, resulting in ranked ordering of Moran’s I values that can be divided by the number of permutations to generate a pseudo p-value that can be measured against \(\alpha=5\%\). The null hypothesis is that there is no spatial autocorrelation and that the observed Moran’s I is within acceptable bounds of random variation. The alternative hypothesis is that there is autocorrelation and that observed Moran’s I is sufficiently unlikely if null hypothesis were true.
Local Indices of Spatial Autocorrelation (LISA)
The LISA procedure determines to what extent do sites in vicinity of a point \(i\) have values that are associated with the value observed at point \(i\). It creates a Moran’s I at every specific location, which is then plotted based on whether each point is positively or negatively spatially autocorrelated with its neighbors. When high values of the observation are associated with high values of it’s neighbors, it falls on the top-left quadrant of the LISA cluster map. When low values are associated with low values of neighbors, these points are plotted on the bottom-left quadrant and so forth. This quadrant classifications will be reflected on a map to illustrate the spatial relationship.
Hypothesis testing is done by doing many permutations in which all points except for the local value are shuffled, and the local Moran’s I is re-calculated. Similar to the global Moran’s I, these local calculations of Moran’s I are ranked for each point, with the rank order used to calculate a pseudo p-value that can be assessed using an threshold such as \(\alpha=5\%\). This p-value can also be mapped using the dataset’s geometry as a visual examination of where spatial autocorrelation is.
Because the weight matrix of 25 miles permits observations without neighbors, 58 observations were ignored in order to perform this LISA calculations in R.
A Review of OLS Regression and Assumptions
There are several main assumptions of OLS regression. A linear relationship is assumed between y and x. Residuals are assumed to be distributed normally, and independent with no relationship between the residual and any of the independent variables. There should also be limited correlation between independent variables, which can reduce the predictive power of both variables. Observations are also assumed to be independent of each other. Lastly, there should be at least 10 observations in the sample size per independent variable. For more information on OLS regression, see see Report #1 and the Appendix.
When data has a spatial component, like public housing sites distributed across southern U.S. states, the assumption that residuals are independent often doesn’t hold. We can test the assumption of independent residuals through a number of different approaches. First, we can calculate the Moran’s I for the residuals (which relies on examining the values of the neighboring residuals, in this case public housing sites within 25 miles). Then we can do the associated significance testing based on Monte Carlo permutations described above. Second, we can regress the OLS residuals on these same nearby residuals. This approach takes the value of the residual at a point, and the mean of the residuals of all it’s neighbors, pairs these as XY-coordinates, and inputs them into a standard OLS model using the minimization of the squared difference between each residual and the mean of it’s neighboring residuals. The slope of the \(\beta\) for the scatter plot would be the coefficient of the neighboring residuals and reflect the change in a specific point’s residual based on a 1-point increase in the mean average of the residuals for it’s neighboring poins.
For this analysis, we will use the R statistical package. R has tests for other regression assumptions outside of independent of residuals. The first assumption is homoscedasticity, which holds that residuals (in addition to not being independent of themselves) do not have a relationship with other independent variables. There are 3 tests in R that examine data for the purpose of testing this assumption:
The Breusch-Pagan Test (Breusch & Pagan, 1979)
The Koenker-Bassett Test (also known as the Studentized Breusch-Pagan Test) (Koenker & Bassett, 1982)
The White Test (White, 1980)
All three tests have a null hypothesis of homoskedasticity (absence of a relationship between the residuals and the independent variables). The alternative hypothesis is heteroskedasticity, that the residuals do vary differently based on the independent variable.
Another assumption that can be examined using R is normality of errors. The Jarque-Bera Test to assess whether residuals are normal (Jarque & Bera, 1980). The null hypothesis of this test is that the residuals are normally distributed and the alternative hypothesis is that they are non-normal.
Spatial Lag and Spatial Error Regression
I will be using R to run spatial lag and spatial error regressions on the data. In this section, I describe these two spatial regression methods, the goals and associated model diagnostics.
Spatial Leg Model (SLM)
Spatial lag regression is a nested form of OLS regression that includes a spatially lagged variable for the independent variable, meaning the mean value observed at the neighboring observations for each observation subject to the specified weight matrix. This model assumes that the value of the dependent variable at one location is associated with the values of these nearby locations.
The regression model is demonstrated in Equation #2:
\[ \begin{aligned} (Eq. 2) hardcost\_per\_unit = & \rho * W y + \\ &\beta_0 + \beta_1 * inspection\_score + \beta_2 * FMR.2.BR + \\ &\beta_3 * log(Total.PIC.Units) + \beta_4 * PUM.Contract.Rent + \\ &\beta_5 * pct\_01br + \beta_6 * pct\_converted \\ &\beta_7 * pct\_black + \beta_8 * pct\_owners + \\ &\beta_9 * pct\_vacant + \beta_10 * urbanicity + \epsilon \end{aligned} \]
- \(\beta_0 \dots \beta_{10}\): Variable coefficients are the same as in Report #1 (with the exception of two new variables, PUM.Contract.Rent, pct_converted, and urbanicity) and are defined in the Appendix of this report.
- \(\rho\): This is similar to the coefficients on the OLS variables but reflects the amount of spatial autocorrelation in the dependent variables (ranging from perfect negative spatial autocorrelation of -1 to perfect positive spatial autocorrelation of 1).
- \(Wy\): This is the additional lagged predictor \(Wy\) added to the model, reflecting the (mean) value of the dependent variable’s neighbors.
Spatial Error Regression
Spatial error regression operates on the assumption that the residual at one location is associated with the residuals at neighboring locations. It performs a first-step regression typical of OLS, then regresses the residual term on the mean of the neighboring residuals. It takes the spatial information hat remains in the residual term and splits it up into a portion that is a spatial relationship (akin to calculating the Moran’s I) and a portion that is random noise. The coefficient generated is constrained between -1 and 1.
The spatial error regression model is demonstrated in Equation #3:
\[ \begin{aligned} (Eq. 3) hardcost\_per\_unit = &\beta_0 + \beta_1 * inspection\_score + \beta_2 * FMR.2.BR + \\ &\beta_3 * log(Total.PIC.Units) + \beta_4 * PUM.Contract.Rent + \\ &\beta_5 * pct\_01br + \beta_6 * pct\_converted \\ &\beta_7 * pct\_black + \beta_8 * pct\_owners + \\ &\beta_9 * pct\_vacant + \beta_10 * urbanicity + \\ & \lambda * W \epsilon + u \\ \end{aligned} \]
- \(\beta_0 \dots \beta_{10}\): Coefficients on main predictors have the same interpretation as the spatial lag model.
- \(\epsilon\): The residual (unobserved variation in the data) in the first-stage OLS model, which also serves as the independent variable of the second-stage regression.
- \(\lambda\): The observed spatial autocorrelation of the residuals of the model, i.e., how much residuals in one location are correlated with the residuals of their neighbors. It is bounded by -1 (perfect negative spatial autocorrelation) and 1 (perfect positive autocorrelation).
- \(W \epsilon\): The (mean) residual of the neighboring points to the independent variable of the second-stage model.
- \(u\): The residual of the second-stage model, representing random noise in the model once autocorrelation has been accounted for.
Model Diagnostics
All assumptions identified for OLD are still needed in spatial lag and spatial error regression with the exception of spatial independence of observations. The goal of these two models is to account for spatial autocorrelation, but they do so in a different fashion. The spatial lag model incorporates spatial structure with neighbors in the dependent variable predicted by the model, useful in instances where observation impact nearby ones (in the case of externalities of the dependent variable). The spatial error model, in contrast, signals the presence of unobserved (spatial) predictors accounting for a portion of the residual. In either case, a significant \(\rho\) or \(\lambda\) coefficient indicates the presence of spatial autocorrelation.
I will compare the results of both the spatial lag and spatial error regression models with OLS to decide which spatial models are an improvement. There are three criteria that will be used:
- Akaike Information Criterion (AIC)/Schwarz Criterion
(SC): AIC and SC measure the goodness of fit of an estimated
model, assessing how much information is lost when a model seeks to
capture reality. They balance the tradeoff between a more precise model,
and one that is more complex. Models with lower scores on these criteria
fit better.
- Log Likelihood: This measure relies on the maximum likelihood approach to estimating parameters in which coefficients are chosen such that the observed data has the greatest chance of occurring. It is a nested model and, thus, can only be used to compare spatial lag model or spatial error model to OLS, not to each other. In other words, it determines whether the additional predictors (the spatial predictors) reflect a significant improvement over OLS in fitting the data in our dataset. This measure is always negative (it’s the log of a percentage) and when the spatial model is higher (less negative) it is considered an improvement.
- Likelihood Ratio Test: This measure converts Log Likelihood into a hypothesis test where the null hypothesis is that one of the spatial models is not better than OLS (and we should not continue use of them in this instance) and the alternative hypothesis is that they are better specifications for the observed data. When the p-value is lower than a specified threshold (typically \(\alpha=5\%\)) than we reject the null.
Another way of comparing spatial lag and error results to OLS is by calculating the Moran’s I of their regression residuals. The introduction of either model should lessen the spatial autocorrelation when compared to the original OLS. The model which has less spatial autocorrelation (i.e. a Moran’s I closer to 0) would more effective in accomplishing teh goal. If one or both models does not significantly reduce the Moran’s I of the residual, they are either not capturing the nature of the of spatial relationship or the weight matrix may need to be adjusted.
Geographically Weighted Regression
We will conduct a geographically weighted regression (GWR) using the R statistical package. GWR can perform better because it allows for variation in the spatial relationship across the plan of interest. The relationship between the physical condition of a public housing property and the amount of investment it receives may be different in rural Mississippi than it is metro Atlanta. Simpson’s Paradox refers to the fact that you can have 4 locally bounded groups of points with relationships between two variables that suggest an overall negative relationship when viewed as a unit, but when you examine them with their immediate neighbors you may see that 3 of the 4 of them have a positive relationship within their neighbor group, but their group has a negative relationship with other groups of observations. The solution of GWR is condduct a series of local regressions for each point with all the neighboring points.
The equation for GWR is illustrated in Equation #4. It articulates a regression for a specific point \(u\), and a calculation of the coefficients \(\beta\) at that point using matrix algebra. A covariance-variance matrix is computed using the independent variables and their neighbors and with the independent variables and the neighbors of the dependent variables. The matrix product of these two matrices is what delivers the vector of coefficients.
$$ \[\begin{aligned} (Eq. 4) hardcost\_per\_unit_i (\textbf{u})= &\beta_{i0}(\textbf{u}) + \beta_{i1}(\textbf{u}) * inspection\_score + \beta_{i2}(\textbf{u}) * FMR.2.BR + \\ &\beta_{i3}(\textbf{u}) * log(Total.PIC.Units) + \beta_{i4}(\textbf{u}) * PUM.Contract.Rent + \\ &\beta_{i5}(\textbf{u}) * pct\_01br + \beta_{i6}(\textbf{u}) * pct\_converted \\ &\beta_{i7}(\textbf{u}) * pct\_black + \beta_{i8}(\textbf{u}) * pct\_owners + \\ &\beta_{i9}(\textbf{u}) * pct\_vacant + \beta_{i10}(\textbf{u}) * urbanicity + \epsilon \\ = &\beta_{i0} + \sum_{k=1}^m {\beta_{ik} x_{ik}} + \epsilon_i \\ \text{Where}& \\ \hat{\beta}(\textbf{u}) &= (X^T W(\textbf{u}) X)^{-1} * (X^T W(\textbf{u}) y) \end{aligned}\]$$
- \(i\): Subscript i describes coefficients and observations at a specific point where one local regression will be completed.
- \(\beta_{i0}(\mathbf{u}) \dots \beta_{i10}(\mathbf{u})\): Coefficients for the m=10 predictors for the local regression at point \(\mathbf{u}\) performed using the neighboring observations.
- \(x_{ik}\): Observations for the m=10 predictors for neighboring observations to point i.
- \(\lambda\): The observed spatial autocorrelation of the residuals of the model, i.e., how much residuals in one location are correlated with the residuals of their neighbors. It is bounded by -1 (perfect negative spatial autocorrelation) and 1 (perfect positive autocorrelation).
- \(\epsilon_i\): The residual for the local regression performed at point i using the neighboring observations.
- \(\hat{\beta}(\mathbf{u})\): Vector of estimated parameters at point \(\mathbf{u}\).
- \(X\): Matrix with all independent variables preceded by a column of 1’s for the intercept term.
- \(y\): Vector of observed dependent variable values.
- \(W(\mathbf{u})\): Weight matrix at point \(\mathbf{u}\) that can be used to weight y or X.
- \(M^T\): Transposed form of matrix M.
- \((X^T X)^{-1}\): Inverse of the variance-covariance matrix.
Under GWR, there will be a regression for the number observations (i) that are in the data and closer observations will be given greater weight. To run a regression, you need multiple points. There are two ways to weight nearby observations. With fixed bandwidth, you set a specific threshold and beyond that threshold points receive a zero weight. This is best for distributions that are fairly stable and uniform across the plane. Adaptive bandwidth sets a target number of observations and goes however far out from a observation is needed to locate that number of neighbors. It is best for when the distribution varies across the plane, including clustered data or non-uniform polygons.
In this study, we will be using adaptive bandwidth which makes sense because the data is not uniform. There are points particularly isolated from other points (one converted property far from others), and points that are clustered (PHAs that have converted multiple properties in their jurisdiction). Consistent with the use of spatial analysis, this approach can (in part) account for the effects of the local implementation. The R package \(spgwr\) will dynamically set the threshold as a percentage of all points in the data set as the basis of GWR.
Assumptions for OLS ares still valid for GWR, with the exception of the spatially independent observations. Multicollinearity and clustering of similar variables can present a particular challenge for GWR. Variables that are significantly redundant will be repeated across all the regressions. Because GWR is building local regressions, when a variable is similar in a clustered way, it will undercut the variability across the model and results will be unreliable. There is also a problem that blends multicollinearity and clustering where two variables are very high in one clustered region and also very low in another clustered region.Dummy variables corresponding to spatial relationships (in the city limits, in the suburbs, etc.) are also not encouraged in GWR because they can create multicollinearity given that they are duplicative of the way the model allows independent variables to vary based on their location. All these instances reduce the reliability of the model results and we can look to the condition number to determine how significant the issue is. When the condition number is either higher than 30, or equal or close to negative infinity, we know the GWR results are unreliable.
After the model has been completed it will not provide p-values. There is an issue of multiple testing because there is a \((m+1)*n\) number of significant tests where m is the number of predictors and n is the number of observations. An \(\alpha=5\%\) would be interpreted as having 5% of those tests come back as significant due to random chance, which would be a problem. We can, however, compute a local statistic by dividng the coefficient by it’s standard error and using the t* critical score of 1.96 to approximate if a predictor is significantly different from 0 at a specific observation.
Results
In this section, I present the analysis of spatial autocorrelation, and the results of the Ordinary Least Square, Spatial Lag Model, Spatial Error Model and Weighted Geographical Regressions.
Spatial Autocorrelation
The Global Moran’s I for the dependent variable of hard cost spending unit is provided below, along with the lagged residual scattered plot and a calculation of the Moran’s I pseudo-p-value based on 999 permutations.
##
## Monte-Carlo simulation of Moran I
##
## data: x
## weights: wfile
## number of simulations + 1: 1000
##
## statistic = 0.28006, observed rank = 1000, p-value <
## 0.00000000000000022
## alternative hypothesis: two.sided
The Moran’s I of 0.28 indicates slight spatial autocorrelation, and the positive slope on the fitted line on the lagged residual scatter plot and the very low p-value of the Monte Carlo simulation confirms that there is spatial autocorrelation in the dependent variable.
The Local Moran’s I is calculated leading to 544 rows of which 5 are sampled below.
| Ii | E.Ii | Var.Ii | Z.Ii | Pr(z != E(Ii)) | |
|---|---|---|---|---|---|
| 16 | 0.2049788 | -0.0083160 | 0.6338022 | 0.2679190 | 0.7887617 |
| 18 | 0.2618243 | -0.0020034 | 0.1536579 | 0.6730431 | 0.5009199 |
| 25 | 0.2676632 | -0.0022516 | 0.1726558 | 0.6495852 | 0.5159602 |
| 37 | -1.0540611 | -0.0043888 | 1.1863086 | -0.9637290 | 0.3351818 |
| 45 | -1.9630299 | -0.0139750 | 2.4894898 | -1.2352899 | 0.2167226 |
| 46 | 0.1449267 | -0.0083028 | 0.5526714 | 0.2061147 | 0.8367013 |
The Moran’s I table has five outputs which are interpreted below:
- \(Ii\): The local Moran’s I for that observation
- \(E.Ii\): The expected local Moran’s I without spatial autocorrelation
- \(Var.Ii\): The local variance of the local Moran’s I
- \(Z.Ii\): The z-score or standardized form of the local Moran’s I
- \(Pr(z != E(Ii))\): The p-value associated with the local Moran’s I at this observation
The following LISA Quadrant map highlights observations whose local Moran’s I is significantly spatially correlated with it’s neighboring values, and whether high and low values are associated with each other. It also indicates points that are not significant, and that have no neighbors and thus no local Moran’s I.
The following map shows the pseudo p-values calculated for every point, indicating where spatial auto-correlation is.
- Discuss the results: what are the not significant, high-high, high-low, low-high and low-low areas on the Cluster Map? Where in the city are these areas?
In the quadrant map, we see significant clusters in Charlotte, NC (High-High), Richmond, VA (Low-Low), Biloxi, MS (Low-Low), Little Rock, Arkansas (Low-Low), Northern Virginia (High-High), a few in Atlanta, GA (Low-Low) and in El Paso, TX (High-High). Most frequently, low values of hard cost spending per unit are surrounded by higher values. This is validated by the p-value map, with some additional areas that exhibit local spatial autocorrelation such as Memphis, TN and Nashville, TN.
A Review of OLS Regression and Assumptions: Results
Results from the OLS regression of the hardcost_per_unit on 10 predictors is provided below in Table 1.
Table 1: Ordinary Least Squares Regression Results**
##
## Call:
## lm(formula = hardcost_per_unit ~ latest_score + lTotal.PIC.Units +
## PUM.Contract.Rent + FMR.2.BR + pct_01br + pct_converted +
## pct_black + pct_owners + pct_vacant + urbanicity, data = rad_south_sf_3857)
##
## Residuals:
## Min 1Q Median 3Q Max
## -101441 -26846 -12350 18137 227349
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 46614.64 26594.57 1.753 0.08021 .
## latest_score -508.21 156.60 -3.245 0.00125 **
## lTotal.PIC.Units 6940.65 2646.21 2.623 0.00897 **
## PUM.Contract.Rent 68.45 26.85 2.549 0.01107 *
## FMR.2.BR -13.17 14.38 -0.916 0.35992
## pct_01br 8803.39 6836.67 1.288 0.19842
## pct_converted -59740.06 6753.49 -8.846 < 0.0000000000000002 ***
## pct_black 18735.31 7402.23 2.531 0.01166 *
## pct_owners 4642.17 12055.32 0.385 0.70034
## pct_vacant 16692.01 25661.53 0.650 0.51567
## urbanicity 10452.45 5506.01 1.898 0.05819 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 44470 on 533 degrees of freedom
## Multiple R-squared: 0.2534, Adjusted R-squared: 0.2393
## F-statistic: 18.09 on 10 and 533 DF, p-value: < 0.00000000000000022
## [1] "Results of the Log Likelihood Calculation:"
## 'log Lik.' -6588.56 (df=12)
There are five variables are statistically significant at \(\alpha=5\%\): the latest inspection score of the public housing site, the number of public housing units at the site, the initial subsidy offered to converting sites, the percentage of pre-program units that have been converted and the black percentage of the census tract. Four variables are not close to being significant: the FMR for a 2BR unit, the percentage of 0-1BR units, the homeownership rate and the percentage of vacant housing in the census tract. Urbanicity is close to being significant, but not given the \(a\) level. The \(R^2\) of 0.2534 indicates that 25% of the variation in the hard cost spending per unit is accounted by changes in these 10 predictors. Report #1 dealt with data from a larger study area, and didn’t adequately account for how destination RAD transactions mix and match sending public housing sites. However, the current model reflects a slight improvement in overall explanation of changes in the dependent variable.
The distribution and standardized residual plot are provided for the OLS model below. They indicate normality, but a clear pattern in relationship to the predicted values of the dependent variable.
Base OLS Model Plots
In addition to a visual examination, the Breusch-Pagan Test, Koenker-Bassett Test (Studentized Breusch-Pagan Test) and the White Test all show p-values below \(\alpha=5\%\) which lead to rejection of the null hypothesis of homoskedasticity. This supports the appearance of heteroskedasticity in the plots.
##
## Breusch-Pagan test
##
## data: reg
## BP = 59.202, df = 10, p-value = 0.00000000513
##
## studentized Breusch-Pagan test
##
## data: reg
## BP = 22.82, df = 10, p-value = 0.01143
## White's test results
##
## Null hypothesis: Homoskedasticity of the residuals
## Alternative hypothesis: Heteroskedasticity of the residuals
## Test Statistic: 89.99
## P-value: 0
The Jarque-Bera test shows that the null hypothesis of normally distributed residuals of the OLS regression is also rejected in favor of non-normality. This demonstrates how a visual examination is often inadequate because the histogram of the residuals is not obviously non-normal.
##
## Jarque Bera Test
##
## data: reg$residuals
## X-squared = 406.54, df = 2, p-value < 0.00000000000000022
A map of the OLS standardized residuals is included, followed by a scatter plot of the residuals against their lagged neighbors and the group of Moran’s I calculations and Monte Carlo significance test. The map doesn’t immediately show a pattern as more extreme residuals are surrounded by less extreme residuals. The scattered plot of the lagged residuals has a positive slope, with a Moran’s I of .19 that is significant against 999 permutations. This indicates significant spatial autocorrelation at the weight matrix of 25 miles. This is suggests that nearby public housing sites with RAD conversions see more similar hard cost spending in associated transactions than two random sites selected from the data set, and that OLS procedures will be unreliable.
##
## Monte-Carlo simulation of Moran I
##
## data: x
## weights: wfile
## number of simulations + 1: 1000
##
## statistic = 0.19221, observed rank = 1000, p-value <
## 0.00000000000000022
## alternative hypothesis: two.sided
Spatial Lag and Spatial Error Regression Results
Spatial Lag Regression
Table 2 below provides the results of the Spatial Lag Model (SLM).
Table 2: Spatial Lag Regession Results**
##
## Call:lagsarlm(formula = hardcost_per_unit ~ latest_score + lTotal.PIC.Units +
## PUM.Contract.Rent + FMR.2.BR + pct_01br + pct_converted +
## pct_black + pct_owners + pct_vacant + urbanicity, data = rad_south_sf_3857,
## listw = weight_matrix_25, zero.policy = TRUE)
##
## Residuals:
## Min 1Q Median 3Q Max
## -119416.3 -25118.2 -9025.6 14788.4 239594.6
##
## Type: lag
## Regions with no neighbours included:
## 132 214 216 291 306 318 323 344 356 374 376 382 390 399 401 920 929 936 950 966 1058 1189 1200 1209 1334 1335 1351 1966 2038 2127 2234 2243 3044 3053 3083 3084 3092 3216 3337 3362 5216 5225 5235 5246 5254 5255 5270 5408 5451 5452 5492 5519 5638 5727 5737 5778 5788 5930
## Coefficients: (numerical Hessian approximate standard errors)
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 35596.363 24613.524 1.4462 0.148118
## latest_score -487.448 146.262 -3.3327 0.000860
## lTotal.PIC.Units 6197.466 2465.436 2.5137 0.011946
## PUM.Contract.Rent 71.232 24.993 2.8501 0.004371
## FMR.2.BR -11.523 13.392 -0.8604 0.389580
## pct_01br 5965.182 6261.856 0.9526 0.340782
## pct_converted -54027.701 6363.352 -8.4904 < 0.00000000000000022
## pct_black 11557.224 7112.121 1.6250 0.104162
## pct_owners 5363.998 11818.068 0.4539 0.649914
## pct_vacant 27781.717 24166.401 1.1496 0.250308
## urbanicity 6057.606 5149.243 1.1764 0.239432
##
## Rho: 0.31789, LR test value: 47.41, p-value: 0.0000000000057586
## Approximate (numerical Hessian) standard error: 0.043411
## z-value: 7.3227, p-value: 0.00000000000024292
## Wald statistic: 53.622, p-value: 0.00000000000024303
##
## Log likelihood: -6564.855 for lag model
## ML residual variance (sigma squared): 1729500000, (sigma: 41587)
## Number of observations: 544
## Number of parameters estimated: 13
## AIC: 13156, (AIC for lm: 13201)
The \(\rho\) coefficient of 0.3178 for \(w\_hardcost\_per\_unit\) is significant with a very low p-value which leads to a rejection of the null hypothesis that there is no spatial autocorrelation and that the spatial lag model does not reflect an improved specification. This indicates that nearby observations have similarly situated hard cost spending per unit.
Four of the five statistically significant variables in the OLS model are still significant (most recent inspection score, log transformation of the number of public housing units, the post-conversion contract rent offered and the percentage of public housing units converted at the site), with the percentage of black population in the census tract no longer being significant.
The following table compares the results for significant predictors in the OLS model with the Spatial Lag Model (SLM). Two coefficient increased, while two coefficients decreased.
| Variable | OLS Coefficient | SLM Coefficient | Comparison |
|---|---|---|---|
| latest_score | -508.21 | -487.448 | Slight decrease |
| lTotal.PIC.Units | 6940.65 | 6197.466 | Moderate decrease |
| PUM.Contract.Rent | 68.45 | 71.232 | Slight increase |
| pct_converted | -59740.06 | -54027.701 | Moderate decrease |
| pct_black. | 18735.31 | 11557.224 | No longer significant |
The Breusch-Pagan and Koenker-Bassett Test both have low p-values which leads to a rejection of the null hypothesis of homoskedasticity for the Spatial Lag Model. The SLM model still did not eliminate all heteroskedasticity.
##
## Breusch-Pagan test
##
## data:
## BP = 78.099, df = 10, p-value = 0.000000000001183
##
## studentized Breusch-Pagan test
##
## data:
## BP = 24.526, df = 10, p-value = 0.00632
The AIC for the SLM model in Table 2 (13156) is less than the OLS model in Table 1 (13201) which indicates the SLM model fits the data better than the OLS model. The Log Likelihood (-6564.855) is less negative than the OLS model (-6588.560), which indicates that the addition of the spatially lagged dependent variable is an improvement. The Log Likelihood Ratio is also significant at the \(\alpha=5\%\) level, which suggests the null hypothesis that the OLS model is better is rejected.
##
## Likelihood ratio for spatial linear models
##
## data:
## Likelihood ratio = 47.41, df = 1, p-value = 0.000000000005759
## sample estimates:
## Log likelihood of lagreg25 Log likelihood of reg
## -6564.855 -6588.560
The Moran’s I scatter plot below indicates a flat line indicating no spatial autocorrelation and certainly less than that of the OLS residuals. This is further supported by the Moran’s I of .00162541 that is not significant with 999 permutations.
##
## Monte-Carlo simulation of Moran I
##
## data: x
## weights: wfile
## number of simulations + 1: 1000
##
## statistic = 0.0014521, observed rank = 590, p-value = 0.82
## alternative hypothesis: two.sided
In all these criteria, the Spatial Lag Model performs better than OLS.
Spatial Error Regression
Table 2 below provides the results of the Spatial Error Model (SEM).
Table 3: Spatial Error Regression Results**
##
## Call:
## errorsarlm(formula = hardcost_per_unit ~ latest_score + lTotal.PIC.Units +
## PUM.Contract.Rent + FMR.2.BR + pct_01br + pct_converted +
## pct_black + pct_owners + pct_vacant + urbanicity, data = rad_south_sf_3857,
## listw = weight_matrix_25, zero.policy = TRUE)
##
## Residuals:
## Min 1Q Median 3Q Max
## -113681.0 -25149.0 -9963.4 15888.6 229424.9
##
## Type: error
## Regions with no neighbours included:
## 132 214 216 291 306 318 323 344 356 374 376 382 390 399 401 920 929 936 950 966 1058 1189 1200 1209 1334 1335 1351 1966 2038 2127 2234 2243 3044 3053 3083 3084 3092 3216 3337 3362 5216 5225 5235 5246 5254 5255 5270 5408 5451 5452 5492 5519 5638 5727 5737 5778 5788 5930
## Coefficients: (asymptotic standard errors)
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 34246.2677 28128.6016 1.2175 0.223418
## latest_score -570.2427 157.6534 -3.6171 0.000298
## lTotal.PIC.Units 6505.2698 2655.2363 2.4500 0.014287
## PUM.Contract.Rent 77.2892 26.5962 2.9060 0.003661
## FMR.2.BR -6.6643 17.9105 -0.3721 0.709828
## pct_01br 3959.3009 6418.0241 0.6169 0.537298
## pct_converted -44894.2830 6143.1236 -7.3081 0.0000000000002711
## pct_black 14501.3478 8071.1301 1.7967 0.072384
## pct_owners 4580.1290 11948.3105 0.3833 0.701476
## pct_vacant 21176.3707 24934.8459 0.8493 0.395732
## urbanicity 10526.9254 6356.6005 1.6561 0.097709
##
## Lambda: 0.35109, LR test value: 40.116, p-value: 0.00000000023937
## Approximate (numerical Hessian) standard error: 0.050661
## z-value: 6.9303, p-value: 0.0000000000042009
## Wald statistic: 48.028, p-value: 0.0000000000042009
##
## Log likelihood: -6568.502 for error model
## ML residual variance (sigma squared): 1741600000, (sigma: 41732)
## Number of observations: 544
## Number of parameters estimated: 13
## AIC: 13163, (AIC for lm: 13201)
The \(\lambda\) coefficient of 0.35109 and is significant with a vary low p-value which leads to a rejection of the null hypothesis that there is no spatial autocorrelation in the residual. This indicates that there is spatial structure in the residual that is not accounted by the predictors in the OLS model.
Four of the five statistically significant variables in the OLS model are still significant (most recent inspection score, log transformation of the number of public housing units, the post-conversion contract rent offered and the percentage of public housing units converted at the site), with the percentage of black population in the census tract no longer being significant.
The following table compares the results for significant predictors in the OLS model with the Spatial Lag Model (SLM). Two coefficient increased (one significantly), while two coefficients decreased (one significantly).
| Variable | OLS Coefficient | SEM Coefficient | Comparison |
|---|---|---|---|
| latest_score | -508.21 | -570.2427 | Signficant increase |
| lTotal.PIC.Units | 6940.65 | 6505.2698 | Moderate decrease |
| PUM.Contract.Rent | 68.45 | 77.2892 | moderate increase |
| pct_converted | -59740.06 | -44894.2830 | Significant decrease |
| pct_black. | 18735.31 | 14501.3478 | No longer significant |
The Breusch-Pagan and Koenker-Bassett Test both have low p-values which leads to a rejection of the null hypothesis of homoskedasticity for the Spatial Error Model. The SEM model, also, did not eliminate all heteroskedasticity.
##
## Breusch-Pagan test
##
## data:
## BP = 92.241, df = 10, p-value = 0.000000000000001887
##
## studentized Breusch-Pagan test
##
## data:
## BP = 32.283, df = 10, p-value = 0.0003592
The AIC for the SEM model in Table 3 (13163) is less than the OLS model in Table 1 (13201) which indicates the SEM model fits the data better than the OLS model. The Log Likelihood (-6568.502) is less negative than the OLS model (-6588.560), which indicates that the addition of the spatially lagged residual is an improvement. The Log Likelihood Ratio is also significant at the \(\alpha=5\%\) level, which suggests the null hypothesis that the OLS model is better is rejected.
##
## Likelihood ratio for spatial linear models
##
## data:
## Likelihood ratio = 40.116, df = 1, p-value = 0.0000000002394
## sample estimates:
## Log likelihood of errreg Log likelihood of reg
## -6568.502 -6588.560
The Moran’s I scatter plot below indicates a flat line indicating no spatial autocorrelation and certainly less than that of the OLS residuals. This is further supported by the Moran’s I of -0.0044 that is not significant with 999 permutations.
##
## Monte-Carlo simulation of Moran I
##
## data: x
## weights: wfile
## number of simulations + 1: 1000
##
## statistic = -0.0043904, observed rank = 503, p-value = 0.994
## alternative hypothesis: two.sided
Based on all these factors, the Spatial Error Model is performing better than the OLS model.
Comparing to each other, both SLM and SEM are improvements on OLS, but still exhibit the presence of additional spatial autocorrelation. This is likely due to the selection of a weight matrix that was smaller than the amount that would reduce spatial autocorrelation in the observations. The models are only accounting for neighbors within 25 miles when there are (at least) neighbors 100 miles further for which there is spatial autocorrelation.
The following table illustrates that the Spatial Lag Model has the slight edge over the Spatial Error Model based on a slightly lower AIC and the presence of less spatial autocorrelation in the residual.
| Criteria | Spatial Lag Model | Spatial Error Model | Conclusion |
|---|---|---|---|
| AIC | 13156 | 13163 | Spatial Lag Model performs slightly better |
| Residual Moran’s I | 0.0014521 | -0.0043904 | Spatial Lag model has less spatial autocorrelation |
Geographically Weighted Regression Results
As describe in the Methods section, R was used to to carry out the bandwidth selection for GWRS using the adaptive approach. Based on minimizing the AIC, the recommended bandwidth is 11.8% of observations, or 64.
## Bandwidth: 0.381966 AIC: 13181.51
## Bandwidth: 0.618034 AIC: 13191.74
## Bandwidth: 0.236068 AIC: 13173.3
## Bandwidth: 0.145898 AIC: 13165.49
## Bandwidth: 0.09016994 AIC: 13165.32
## Bandwidth: 0.1154878 AIC: 13162.34
## Bandwidth: 0.117666 AIC: 13161.83
## Bandwidth: 0.1284497 AIC: 13162.18
## Bandwidth: 0.1222643 AIC: 13162.1
## Bandwidth: 0.1194224 AIC: 13162.05
## Bandwidth: 0.116834 AIC: 13162.03
## Bandwidth: 0.1181122 AIC: 13161.88
## Bandwidth: 0.1177067 AIC: 13161.84
## Bandwidth: 0.1173482 AIC: 13161.9
## Bandwidth: 0.1175446 AIC: 13161.86
## Bandwidth: 0.1176253 AIC: 13161.84
## Bandwidth: 0.117666 AIC: 13161.83
For comparison purposes, we run an fixed bandwidth. Based on minimizing the AIC, the recommended fixed bandwidth is 385,973 or 239.8 miles which is nearly 10 times the distance used for preceding spatial analyses.
## Bandwidth: 1451382 AIC: 13195.87
## Bandwidth: 2346041 AIC: 13198.83
## Bandwidth: 898452.4 AIC: 13192.47
## Bandwidth: 556723.2 AIC: 13189.34
## Bandwidth: 345522.9 AIC: 13187.23
## Bandwidth: 214993.9 AIC: 13206.81
## Bandwidth: 426194.2 AIC: 13187.08
## Bandwidth: 396194.3 AIC: 13186.81
## Bandwidth: 390200.7 AIC: 13186.79
## Bandwidth: 384002.7 AIC: 13186.79
## Bandwidth: 385934.8 AIC: 13186.79
## Bandwidth: 385994.6 AIC: 13186.79
## Bandwidth: 385973.3 AIC: 13186.79
## Bandwidth: 385973.4 AIC: 13186.79
## Bandwidth: 385973.5 AIC: 13186.79
## Bandwidth: 385973.4 AIC: 13186.79
## Bandwidth: 385973.3 AIC: 13186.79
## Bandwidth: 385973.3 AIC: 13186.79
## Bandwidth: 385973.4 AIC: 13186.79
## Bandwidth: 385973.4 AIC: 13186.79
Below are the GWR results using the adaptive bandwidth generated through the AIC minimization approach. The GWR model has a \(R^2\) of 0.4131648, which indicates it is explaining 16% more variation in hard cost spending per unit than the OLS \(R^2\) of 0.2534. Unfortunately, the condition number for the model could not be determined in order to assess whether there is significant multicollinearity to the point of rendering the use of GWR problematic.
Table 4: Geographically Weighted Regression Results
## Call:
## gwr(formula = hardcost_per_unit ~ latest_score + lTotal.PIC.Units +
## PUM.Contract.Rent + FMR.2.BR + pct_01br + pct_converted +
## pct_black + pct_owners + pct_vacant + urbanicity, data = rad_south_sf_3857s,
## gweight = gwr.Gauss, adapt = bw, hatmatrix = TRUE, se.fit = TRUE)
## Kernel function: gwr.Gauss
## Adaptive quantile: 0.117666 (about 64 of 544 data points)
## Summary of GWR coefficient estimates at data points:
## Min. 1st Qu. Median 3rd Qu.
## X.Intercept. -80800.80636 -7597.10929 27664.35889 46711.00002
## latest_score -933.98712 -670.97199 -554.30474 -489.31860
## lTotal.PIC.Units -10787.67899 1520.48056 4935.69979 9527.29268
## PUM.Contract.Rent 32.20415 56.20936 87.78761 103.98364
## FMR.2.BR -84.13565 -0.57498 44.45732 102.65491
## pct_01br -11932.10688 3894.34891 9605.10537 12306.00275
## pct_converted -80586.60548 -70283.89796 -59836.18604 -47583.49816
## pct_black -20242.12552 1123.91940 7373.36548 26258.97156
## pct_owners -31426.70926 -19041.81877 -9911.69325 3743.99090
## pct_vacant -17557.54230 -1092.04033 12125.72073 31170.48102
## urbanicity -9003.57037 -2434.58818 1718.63991 7913.40250
## Max. Global
## X.Intercept. 130484.97441 46614.638
## latest_score -288.16600 -508.214
## lTotal.PIC.Units 15842.05314 6940.654
## PUM.Contract.Rent 174.94389 68.454
## FMR.2.BR 138.88833 -13.174
## pct_01br 16835.55620 8803.390
## pct_converted -28469.99977 -59740.056
## pct_black 51863.39091 18735.305
## pct_owners 58005.15362 4642.169
## pct_vacant 190953.11716 16692.013
## urbanicity 16807.58788 10452.453
## Number of data points: 544
## Effective number of parameters (residual: 2traceS - traceS'S): 68.98078
## Effective degrees of freedom (residual: 2traceS - traceS'S): 475.0192
## Sigma (residual: 2traceS - traceS'S): 41762.7
## Effective number of parameters (model: traceS): 51.20744
## Effective degrees of freedom (model: traceS): 492.7926
## Sigma (model: traceS): 41002.66
## Sigma (ML): 39025.16
## AICc (GWR p. 61, eq 2.33; p. 96, eq. 4.21): 13161.83
## AIC (GWR p. 96, eq. 4.22): 13097.31
## Residual sum of squares: 828491811147
## Quasi-global R2: 0.4131648
# Assuming you have the design matrix X
X <- model.matrix(~ latest_score +
lTotal.PIC.Units +
PUM.Contract.Rent +
FMR.2.BR +
pct_01br +
pct_converted +
pct_black +
pct_owners +
pct_vacant +
urbanicity,
data=rad_south_sf_3857s)
# Perform Singular Value Decomposition (SVD)
svd_result <- svd(X)
# Extract the singular values
singular_values <- svd_result$d
# Calculate the condition number
condition_number <- max(singular_values) / min(singular_values)
# Print the condition number
print(condition_number)## [1] 14589.91
The table compares AIC and Moran’s I across all 4 models, and is followed by the group of Moran’s I plots and significant test for the standardized residuals. GWR fits the better data, but doesn’t appear to remedy all spatial autocorrelation even when it expands the general neighborhood of each observation.
| Criteria | OLS | SLM | SEM | GWR | Conclusion |
|---|---|---|---|---|---|
| AIC | 13201.1191986 | 13156 | 13163 | 13097.31 | GWR fits the data better |
| Residual Moran’s I | 0.2801 | 0.0014521 | -0.0043904 | 0.1325 | Spatial Lag model has the least spatial autocorrelation |
##
## Monte-Carlo simulation of Moran I
##
## data: x
## weights: wfile
## number of simulations + 1: 1000
##
## statistic = 0.13246, observed rank = 1000, p-value <
## 0.00000000000000022
## alternative hypothesis: two.sided
The following map shows the standardized residuals across the study area. There is a cluster of extreme values in Atlanta and Nashville, but no otherwise apparent patterns.
Table 4 examines the local GWR coefficients in contrast to the original OLS estimates under the “Global”. There is significant variation in the intercept, but it was and remains insignificant. Beyond that, there is noticeable local variation in the coefficient for the log transformation of the number of units.
The resulting map of the standardized shows the spatial location of the coefficients.
- latest_score: Lower inspection scores ae consistently leading to higher per-unit spending, with a lower impact in western North Carolina.
- lTotal.PIC.Units: In North. Carolina, increases in the size of public housing sites lead to less funding (but in a decreasing rate). Everywhere else, they lead to more funding.
- PUM.Contract.Rent: The Southeastern part of the map experiences a higher impact of increased contract rents in the amount of investment received by a property.
- pct_01br: Smaller bedroom units are more of a negative factor in Tennessee and Northern Alabama and Mississippi.
- pct_converted: The percentage converted is a significant negative factor everywhere but less in Western North Carolina.
- pct_black & pct_owners: The black percentage and ownership rates of the census tract is most significant at the nexus of Texas, Louisiana and Arkansas.
- pct_vacant: Vacant housing rates are most significant in North Carolina and Virginia.
- Urbanicity: Urbanicity is most significant along the border of Texas and Mexico.
The final maps presents the local \(R^2\) calculations that range from 0.2 to 0.3 (Atlanta, Charlotte, Knoxville, Nashville) to 0.5 to 0.6 (Mississippi). This reflects where the 10 predictors do a better job of accounting for variation.
Spatial Relationship in Inspection Score
This additional analysis shows that there is significant autocorrelation in the latest score independent variable. Clustering could be a surce of multi-collinearity that poses as a problem with GWR.
##
## Monte-Carlo simulation of Moran I
##
## data: x
## weights: wfile
## number of simulations + 1: 1000
##
## statistic = 0.25627, observed rank = 1000, p-value <
## 0.00000000000000022
## alternative hypothesis: two.sided
Discussion
In this report, I examined the spatial autocorrelation of the amount of hard cost investment allocated to public housing sites that have had some portion of their public housing units converted long-term Section 8 rental assistance and removed from the public housing inventory under the Rental Assistance Demonstration. Building on Report #1, the research question resolved around whether their physical inspection scores is a significant driver of the investment level. Spatial autocorrelation is theoretically important, because properties could have spatial patterns of their physical conditions, in areas with different market attractiveness attractive that receive less comparative financing that make the OLS model results unreliable. Indeed, the determination of the observations, Moran’s I, Local Moran’s I, and the significant testing of the Spatial Lag, Spatial Error and Geographically Weighted Regression models support the presence of autocorrelation, whether it is from nearby investment in properties affecting each other or in additional spatial structure in the variation in hard cost spending not being accounted for by the selected predictors. The proximity of the AIC for SLM and SEM suggest that both types of spatial autocorrelation could be present. The GWR accounts for a marked increase in explaining variation and fit the data better than SLM and SEM, but it also did not account for all of the spatial autocorrelation.
In all measures, inspection score remains a significant driver of hard cost spending when accounting for spatial correlation. More funding is going to projects with greater need. With the SEM, model it increases above 5% over the OLS model, suggesting there is an unobserved spatial aspect of the data that is connected to it. Across all observations the GWR model finds this factor to to have a major effect on the variation of the hard cost spending per unit.
| OLS Coefficient | SLM Coefficient | SEM Coefficient | GWR Range |
|---|---|---|---|
| -508.21 | -487.448 | -570.2427 | (-933.98712, -288.16600) |
We also know that the models confirm that the level of subsidy offered to a converting project and the original size of the property are also significant drivers in how much investment is allocated. High contract rents are associated with higher allocation of funding, more than a particular property would “need” when accounting for it’s location.
Limitations
There are several limitations of this study. The Spatial Lag and Spatial Error models both indicated the presence of continuing heteroskedasticity. The nature of local implementation of RAD conversions weakens the assumption of independence of observations, farther than what spatial autocorrelation could account for. A hierarchical model may be better for capturing that relationship. As mentioned, these models also used a weight matrix based on 25 miles that was shown by GWR to be insufficient to capture the bandwidth with which spatial autocorrelation is fully accounted for (~ 240 miles).
In addition, the use of census tract features for contextualizing point data creates a source of spatial autocorrelation through clustering and multicollinearity. Data that captures neighborhood conditions in a more granular level would be better.
During the data preparation, it was observed that there is a distinct pattern of “non-construction” conversions with no hardcost per spending that has been reported to be as high as 18-26% in program evaluations Stout et al. (2019). The presence of these $0 spending observations (and others that are nominally $0) pulls down the average spending per unit and it could be argued that they should be omitted. They are also, the literature says, have a degree of better physical conditions which conforms to the conceptual framework of inspection scores driving the level of investment. A better balanced approach, might account for these projects using a dummy variable.
Lastly, because of the grouping of units from multiple public housing sites, the hard cost spending unit for the sending public housing sites had to be approximated from a weighted averages because the construction spending is only reported for the destination transactions.
Note on Methods
There is common confusion between weighted lagged residual and spatial lag residuals. In the latter you include the spatial leg of the dependent variable in the model specification, and the residual is calculated in a standard fashion (predicted values - observed values) with spatial autocorrelation reflected. In the the weighted residuals model, you are applying the weight matrix in adjusting each residual based on it’s relationship with it’s neighboring residuals.
ArcGIS was not used in this report, and it is generally considered to be problematic for GWR. The method for optimizing bandwidth has changed in more recent models and can create unnecessary confusion. When you use the Golden Search approach that is expected to select a bandwidth with the minimum AICc (for small samples), it can return a bandwidth that does not actually minimize AICc. In past and current version of ArcGIS/ArcMap, the calculation of local \(R^2\) can also result in negative values which are incorrect.
Appendix 1: Model Variables
Dependent Variables:
- \(hardcost\_per\_unit\) = The total hard construction costs for RAD projects matched to the public housing site divided by the total number of units (RAD or not). This is the dependent variable in the model specification, meaning that it’s variation will be explained by the variation of all other variables
Independent Variables:
- \(inspection\_score\) = The latest inspection score associated with the public housing site from the outset of the program (2011)
- \(FMR.2.BR\) = HUD’s Fair Market Rent associated with a 2BR rental unit for the geographic area containing the public housing site
- \(log(Total.PIC.Units)\) = The log transformation of the total pre-RAD number of public housing units at the site
- \(PUM.Contract.Rent\) = The preliminary contract rent guaranteed to a converting public housing site in 2012 following conversion under RAD.
- \(pct\_01br\) = The percentage of 0 and 1BR public housing units at the site (0-100)
- \(pct\_converted\) = The number of public housing units at the site divided by \(Total.PIC.Units\)
- \(pct\_black\) = The black population percentage for the census tract containing the public housing site (0-100)
- \(pct\_owners\) = The percentage of owner-occupied housing units for the census tract containing the public housing site (0-100)
- \(pct\_vacant\) = The percentage of vacant housing units for the census tract containing the public housing site (0-100)
- \(urbanicity\) = 2010 urbanicity measure from Department of Agriculture with 1-3 as urban and 4-10 as rural
Spatial Independent Variables:
- \(Wy\) = Mean value of observations that are neighbors of a specific observation in the Spatial Lag Model.
- \(W\epsilon\) = Mean value of residuals that are neighbors of a specific observation’s residual in Spatial Error Model.
Regression Coefficients:
- \(\beta_0\): This is the regression coefficient for the intercept value, which can be interpreted as the predicted amount of hard construction costs (per unit) when all the independent variables above are 0. Several of these variables (such as inspection score, total PIC units and the FMR of 2BR units) have no practical meaning at the 0-level so this coefficient has no interpretative value.
- \(\beta_1\): This reflects the increase in amount of hard cost spending per unit associated with a 1-point increase in the inspection score.
- \(\beta_2\): This reflects the increase in the amount of hard cost spending per unit associated with a $1 increase in the FMR of 2BR unit applicable to the public housing site (which is assumed to translate to a proportional increase in the FMRs for all other bedroom sizes).
- \(\beta_3\): This is the regression coefficient for the log transformation of the total number of public housing units at the original public housing site. For every 1% increase in the number of units, the construction spending per unit will increase by \(B_3 * ln \frac{101}{100}\).
- \(\beta_4\): This reflects the increase in the amount of hard costs per unit when the preliminary contract rent for a converting site goes up by $1.
- \(\beta_5\): This reflects the increase in the amount of hard cost per unit spending associated with a 1-point increase in the the percentage of 0BR and 1BR apartment units in the original public housing site.
- \(\beta_6\): The reflects the increase in amount of hard cost per unit spending associated with a 1% increase in the proportion of original units converted at a site.
- \(\beta_7\): This is the regression coefficient for log transformation of the black population in the census tract containing the public housing site.1% increase in the black population translates to a For every 1% increase in the number of units, the construction spending per unit will increase by \(B_5 * ln \frac{101}{100}\) increase in the construction spending per unit.
- \(\beta_8\): This reflects the increase in the hard cost per unit associated with a 1% increase in the homeownership rate for the census tract containing the public housing site.
- \(\beta_9\): This reflects the increase in the hard cost per unit associated with a 1% increase in the vacancy rate for the census tract containing the public housing site.
- \(\beta_{10}\): This coefficient for the dummy variable of urbanicity reflects the increase in hard cost spending per unit for observations that are in rural areas compared to urban areas.
Spatial Coefficients
- \(\rho\): The coefficient of the y-lag variable \(Wy\) that ranges from -1 to 1 and reflects how a 1-point change in the lagged variable is associated with changes in the original observation’s value.
- \(\lambda\): The coefficient of the lagged residual \(W \epsilon\) that ranges from -1 to 1 and reflects how a 1-point change in the neighboring residuals is associated with changes in the observations residual.
Error Term:
- \(\epsilon\): The error term in the regression model reflects the variation in the independent variable that is not explained by the changes in the independent variables.
Spatial Error Term
- \(u\): The error term in the second stage of the Spatial Error Model that (should) reflect the random the variation in the independent variable that is not explained by both the changes in the independent variables and the spatial autocorrelation.
Appendix 2: OLS Descriptive Statistics for Study Area
Descriptive Statistics
| varname | type | n | mean | sd |
|---|---|---|---|---|
| hardcost_per_unit | DV | 544 | 37095.9080247 | 50990.1299459 |
| pct_01br | IV | 544 | 0.3347926 | 0.3119880 |
| FMR.2.BR | IV | 544 | 732.0551471 | 178.7839954 |
| Total.PIC.Units | IV | 544 | 163.8915441 | 140.5411313 |
| PUM.Contract.Rent | IV | 544 | 556.0755882 | 94.8843974 |
| pct_converted | IV | 544 | 0.8949046 | 0.2985923 |
| urbanicity | IV | 544 | 0.7996324 | 0.4006439 |
| pct_black | IV | 544 | 0.4413803 | 0.3060956 |
| pct_owners | IV | 544 | 0.4307117 | 0.1918004 |
| pct_vacant | IV | 544 | 0.1504956 | 0.0789507 |